harp update March 2025

Andrew Singleton

harpIO

Arrow datasets

  • More experimentation done
  • Only for local disk
  • Inline vs offline wrangling
  • Timings vs SQLite
  • Still some edge cases to make sure to catch

Timings for data as is

Timings for accumulated data

Pros and Cons

✓ Much faster than SQLite

✓ Takes about 1/3 of disk space of SQLite

✓ No need for indexes

✓ Lends itself to cloud storage (e.g. S3)


✗ Offline wrangling seems to steal memory that isn’t given back

✗ First access to dataset is slow on distributed file systems

✗ File system based partitioning makes dataset navigation tough


? Code needs more work - strange hangs

harpPoint

Rewrite of verification functions

  • det_verify() and ens_verify() rewritten / refactored

  • Faster and more memory efficient - removed significant bottlenecks

  • Dedicated internal functions that do all the data wrangling

  • prep function, compute function, cont function (det), prob function (ens)

  • Score name passed as new_det/ens_score, new_det_cont_score, new_ens_prob_score, new_det/ens_score_opts.

prep_ens/det_<score>()

  • Set by new_det/ens_score argument

  • Prepares data before computing the score

  • Operates on data before grouping, e.g. to create new columns

    • Bias for each case, sorting ensemble for CRPS
  • Arguments: df, fc_col, ob_col, opts

    • opts is a named list passed in new_det/ens_score_opts

compute_ens/det_<score>()

  • Set by new_det/ens_score argument

  • Computes the score for each group

  • Typically a dplyr::summarize() operation on the data frame

  • Arguments: grouped_df, show_pb, pb_env, opts

    • opts is a named list passed in new_det/ens_score_opts

    • The need for the function to include show_pb and pb_env will hopefully be removed

New contingency table score

  • Set by new_det_cont_score argument

  • Data already prepped with hit, miss, false alarm or correct rejection for each case

  • Function takes arguments: ob_prob, fc_prob, hit, false_alarm, miss, correct_rejection, show_pb, pb_env, opts.

  • opts is a named list passed in new_det_score_opts

  • Called by dplyr::summarize() internally

New ensemble probability score

  • Set by new_ens_prob_score argument

  • Data already prepped with ensemble probability and binary observed probability

  • Function takes arguments ob_prob, fc_prob, show_pb, pb_env, opts

  • opts is a named list passed in new_det_score_opts

  • Called by dplyr::summarize() internally

Things to look out for

  • In det_cont and ens_prob functions, obs is the first argument for consistency with the {verification} package - order of obs and fcst args might be reversed before release

  • Compute functions must include arguments and functionality to update a progress bar

compute_det_myscore <- function(grouped_df, show_pb, pb_env, opts) {
  fun <- function(x, y, opts, show_pb, pb_env) {
    res <- (x / y) * opts$myopt
    tick_progress(show_pb, pb_env)
    res
  }
  dplyr::summarize(
    grouped_df, 
    myscore = fun(.data$fcst, .data$obs, opts, show_pb, pb_env)
  )
}

Verification for quantiles

  • Simply express thresholds as “q

    • e.g. ens_verify(..., thresholds = c("q0", "q0.25", "q0.5", "q0.75", "q1"))
  • Thresholds will be calculated from the observations for the group

New ensemble score: tw_crps

  • Threshold weighted CRPS

  • Computes CRPS emphasizing cases the fall within the threshold

  • “Clamps” data outside the threshold of interest

Threshold weighted CRPS

Threshold weighted CRPS

harpVis

Some new funcitonalities

  • Better plotting for spatial scores [James]

  • Mean forecast and mean observation values now available in plots

harp

Functionalities for batch running

  • Still some testing to do

  • Documentation needs writing

  • Won’t be worked on until after Easter